Approximate inference via variational sampling 



We propose a new method to approximately integrate a function with respect 
to a given probability distribution when an exact computation is intractable. The 
method is called "variational sampling" as it involves fitting a simplified distribution 
for which the integral has a closed-form expression, and using a set of randomly 
sampled control points to optimize the fit. The novelty lies in the chosen objective 
function, namely a Monte Carlo approximation to the generalized Kullback-Leibler 
divergence, which differs from classical methods that implement a similar idea, such 
as Bayesian Monte Carlo and importance sampling. We review several attractive 
mathematical properties of variational sampling, including well-posedness under a 
simple condition on the sample size, and a central limit theorem analogous to the 
case of importance sampling. We then report various simulations that essentially 
show that variational sampling has the potential to outperform existing methods 
within comparable computation time in estimating moments of order up to 2. We 
conclude with a brief discussion of desirable enhancements. 

1 Introduction 

The general goal of statistical inference is to summarize a multivariate probability distribution 
by a set of parameters of interest such as the normalization constant (if unknown), the mean, 
the variance matrix... In many cases, such a summary takes the form of a vector- valued integral: 



where (p : M, d —> R n is a feature function, and p : M, d — >> R + is the target distribution representing, 
for instance, a Bayesian posterior given some data (omitted from the notation) or a model 
evidence. In the frequent situation where is analytically intractable, one has to resort to 
numerical methods, which may be classified as being either sampling-based or fitting-based. 

Traditional fitting methods such as Newton-Cotes and Gaussian quadrature rules are compu- 
tationally prohibitive in any but the smallest dimensions. Sampling, or Monte Carlo integration 
methods were developed from the 40 's to overcome this curse of dimensionality [TT| IT]. The ba- 
sic idea is to sample random points xi, . . . , xjv independently from an instrumental distribution 
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7r(x), and estimate the desired integral by: 

= !T T y2 w (xk)(t)(xk), with w(x) = ^\, (2) 

TV ^ 7r(x) 

where the weights w(xfc) guarantee that the resulting estimator is unbiased. This simple method, 
known as importance sampling (IS), may however be statistically inefficient if high density 
regions of p(x) are not sampled. This problem has motivated a number of refinements that may 
be understood as variance reduction devices, such as defensive and multiple IS [T7], adaptive 
IS 0, path sampling [9], annealed IS [14J, and nested sampling [21]. Methods from a more 
radical trend are explicitly aimed to sample from the target distribution, and include both 
independence sampling methods such as rejection sampling [3] or population Monte Carlo [6J, 
and Markov chain Monte Carlo (MCMC) methods [221H1I2]. 

In the meantime, fitting-based methods have been developed or revisited for computational 
statistics. For instance, the Laplace method [23J fits a Gaussian using the second-order Taylor 
expansion of the distribution logarithm at the mode. Bayesian quadrature [161 EH] interpolates 
the target distribution (or part of the integrand) using Gaussian processes. Along with the 
development of graphical models, "message-passing" methods have emerged, such as the vari- 
ational Bayes algorithm [T5J [3] , belief propagation and expectation propagation [T2| H]. The 
common thread of message-passing methods is to work on factor graph-structured distributions, 
by adjusting factors in a sequential fashion rather than optimizing a global (and generally in- 
tractable) fitting measure [15] . 

Generally speaking, fitting-based methods tend to be fast but rely on restrictive assumptions 
regarding the target distribution, while sampling-based methods are widely applicable but may 
require many function evaluations to reach acceptable accuracy in practice. Combining both 
approaches gives rise to a class of hybrid methods that we call variational sampling in this paper. 
A widespread implementation of variational sampling involves fitting a parametric model by 
maximum weighted likelihood such as, e.g., in the Monte Carlo EM algorithm [23], the cross- 
entropy method [8J, or stochastic approximation [7]. This natural approach generalizes IS, as 
shown in Section |2.1| Another previous variational sampling method is Bayesian Monte Carlo 
(BMC) [19], a simplification of Bayesian quadrature where control points are selected randomly. 

We here propose a new objective function for variational sampling, which adds an important 
consistency property to the weighted likelihood objective, namely that the integral estimator is 
exact whenever the target distribution is among the fitting distributions. It will also turn out 
numerically more stable than BMC. 



2 Variational sampling 

Our starting point is a well-known variational argument: computing the integral is equiva- 
lent, under some existence conditions, to substituting p with the distribution q that minimizes 
the inclusive Kullback-Leibler (KL) divergence [T5j : 

D(p\\q) = y P (x)log^|dx, (3) 

over the exponential family: 

20 (x) = e^ (x) , subject to / ^(x)dx = Z p , (4) 



2 



where Z v — J p is the normalization constant of p, assumed known for now. The substitution 
may yield a closed- form expression for the integral if the exponential family is, e.g., a subset of 
Gaussian distributions, discrete distributions, or products of coordinate-wise univariate distri- 
butions. While the difficulty is then to minimize ^ because the KL divergence may be just 
as intractable as ip (</>), it becomes natural to use a Monte Carlo approximation to D(p\\q) as a 
surrogate cost function. 

2.1 Weighted likelihood 

An obvious such approximation is [211 El E] • 

Liik(q) = J i Y J w (^k)^og^^, (5) 

where xi, . . . , xjv are drawn independently from an instrumental distribution 7r(x) with support 
included in the support of p(x), and w(x.j G ) = p(x&)/7r(x&) are strictly positive importance 
weights that warrant that ^ approaches D(p\\q) in the limit of large samples. Minimizing ^ 
boils down to a familiar maximum likelihood estimation problem, yielding a similar moment 
matching condition as in the continuous KL divergence case: 



gfl(x)0(x)dx = ^ — - — r y^w(xfc)0(xfc), 



which is nothing but the classical IS estimate if Z p is estimated by IS, as is natural, i.e. Z p « 
J2k w (^k)- We have thus recast IS as a variational problem, which brings little practical value 
but stresses a major flaw of IS. Arguably, ^ is a poor fitting measure since it is generally 
possible to find a distribution g, even in the exponential family, for which Luk(q) < L^(p), 
meaning that there may exist distributions that fit p better than p itself... 

2.2 Alternative cost 

To work around this inconsistency, we reformulate the fitting problem using the generalized 
KL divergence [13J: 



D*(p\\q) = J p(x) log^jdx-y p(x)dx + J g(x)rfx, 



(6) 



which is a positive- valued convex function of (p, q) for any pair of unnormalized distributions, 
and vanishes if and only if p = q. It is easy to show that minimizing D{p\\q$) subject to J q$ — Z p 
is equivalent to minimizing D*(p\\qQ) without constraint. In other words, the constraint is 
automatically satisfied when minimizing the generalized KL divergence. 
Let us now remark that D+(p\\q) is an expected loss: 



DM0) = j P(x)£(x,q)dx, 



where the "pointwise" loss £(x,q) = — log(r(x)) — 1 + r(x) is a function of the ratio r(x) = 
g(x)/_p(x). This strongly suggests the following Monte Carlo approximation as an alternative 
to @: 



N 

k 
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As depicted in Figure [TJ the loss £(x.,q) at any point x is minimized and vanishes iff q(x) — 
p(x). It follows that L(q) < L(p) for any unnormalized distribution q with equality iff q coincides 
with p at the sampling points. Hence, if p lies in the approximation space, then it is a global 
minimizer of L(q) regardless of which points are sampled and how many (we show below that 
the minimizer is unique provided that the sample is large enough). Therefore, as illustrated in 
Figure [2j minimizing Q is expected to produce more accurate integral estimates than IS if the 
target distribution is reasonably "close" to the approximation space. 
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Figure 1: Comparison of the pointwise loss functions corresponding to the classical KL diver- 
gence (gray) and the generalized KL divergence (black). 




Figure 2: One-dimensional examples of fitting using VS (blue), IS (orange) and BMC (green) 
for exponential power target distributions with shape parameter 1 (left) and 3 (right). 
See Section [3] for details. Note that the Laplace approximation is undefined on the 
left, and flat on the right. 

When choosing the approximation space as the exponential family Q, each pointwise loss 
turns out to be a smoothly convex function of 0, therefore L is a smoothly convex function of 
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9 with gradient and Hessian matrix: 



V e L = ^*(w - w), V#Vj L = — *diag(w)$ , 

where <E> is the n x N matrix with general element = (^(x&), while w and w denote the 
vectors of importance weights and fitted importance weights, respectively, i.e. Wk = p(x^)/7r(x/ c ) 
and Wk = Wkqe(^k) /p( x /c)- The Hessian may be interpreted as the Gram matrix of the rows of <E> 
for the dot product: (<J>^, <J>j) = (1/N) ^2 k w/ c ^(x/ c )0 J (xfc). While there is generally no closed- 
form solution to the minimization of Q, the following theorem provides a simple sufficient 
condition for the existence and uniqueness of a minimum. 

Theorem 2.1. If the matrix <l> has rank n, then L(9) admits a unique minimizer in M, n . 
Sketch of proof If <l> has rank n, then <l>diag(w)<l> is fully ranked, hence positive definite for 
any 9. This implies that L{9) is strictly convex and guarantees uniqueness. Existence then 
follows from the fact that L(9) is coercive in the sense that rimy^i^^ L(9) = +oc. To prove 
that, we decompose 9 via 9 = tu where t is a scalar and u a fixed unit-norm vector. $ being 
fully ranked means that u T <l> ^ 0, hence there exists at least one k for which ^^^(x^) ^ 0, 
implying that lim^ +00 ft u (xfc)/p( x fc) G {0, +00} therefore lim^+oo ^(xfc, qtu) = +00, which 
completes the proof. □ 

For many choices of <j) functions, such as a set of linearly independent polynomials, the 
condition rank(<l>) = n is verified as soon as N > n distinct points are sampled. We may then 
track the unique minimizer of L{9) using a numerical optimization scheme such as the Newton 
method with adaptive step size [T8j . 



2.3 Central limit theorem 

The cost function ^ is a special case of Monte Carlo approximations in decision-making pre- 
viously studied by Shao [20], which enjoy stochastic convergence properties when the loss is 
smoothly convex. In particular, we have a central limit theorem analogous to the case of im- 
portance sampling (to avoid technicalities, we will assume here that the sample space is finite, 
which is only a conceptual restriction since x- values are computer-encoded in practice). 



Theorem 2.2. Under the conditions of Theorem 2.1, let 9 = argmin^^n L{9) and let the 
corresponding integral estimator Ip{(j)) — I q ~{<j)). If D*(p\\qo) admits a unique minimizer 9*, 
then I p {4>) converges in distribution to I p ((f)): 

VN[I P {^) - I p m A AA(0, S), with S = / b(x) 7^ (x) ^ (x)^(x) T &. (8) 

Sketch of proof. Most of the proof rests upon Shao [20], theorem 3, implying that 9 converges in 
distribution to the minimizer 9* of D*{p\\qo). The result then follows from a Taylor expansion 
of I qe (4>) around 0*. □ 

This asymptotic behavior is to be compared with the IS estimator, which is known to be 
unbiased with variance S/s/iV, where: 



S/ S = / ^</.(x)(/»(x) T dx-I p ((/»)I p ((/»)" 

J 7T( X ) 



(9) 



While both estimation variances decrease in 0(1/7V), the most noticeable difference between 
^ and ([9]) is that E vanishes regardless of the sampling distribution 7r(x) whenever p(x) lies in 
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the approximation space, so that p = I n contrast, at most one eigenvalue of £js can cancel 
out by an appropriate choice for 7r(x), meaning that an exact estimation of I p {4>) is impossible 
using IS from the same sample if n > 1. In this regard, the proposed method, simply referred 
to as variational sampling (VS) in the sequel, appears as yet another variance reduction device 
for IS. 

2.4 Annealed variational sampling 

As shown by Q, the accuracy of VS is dependent upon the sampling distribution 7r(x) if the 
target distribution is outside the exponential family. While optimizing 7r(x) is out of the scope of 
this paper, a strategy that proves useful in practice is to adapt Neal's annealed IS method pH] . 
The basic idea is to obtain each point x& as the final state of a Markov chain z& = (x^, x|, . . .) 
having some initial distribution 7r(x) and intermediate distributions of the form 7r(x) 1-A p(x) A 
as successive transition kernels, for a decreasing "inverse temperature" parameter A — >► 0. Neal 
realized that the importance weights corresponding to this sampling scheme are remarkably 
easy to compute in the extended state space fH] . 

The VS method may be generalized by simply replacing wfak) in Q with Neal's extended- 
state importance weights w(zk). All the properties established in the case of simple sampling 
(convexity, uniqueness, central limit theorem) are maintained, and the asymptotic variance 
formula generalizes by substituting 7r(x) in Q with the "apparent" sampling distribution: 



where 7r(z) is the distribution of extended states induced by the annealed sampling procedure. 
This method may overcome a poorly specified initial sampling distribution at the cost of mul- 
tiplying the sampling time by the number of intermediate steps. 

3 Experiments 

The purpose of this section is to evaluate VS in the problem of estimating the moments of 
order 0, 1 and 2 of a multivariate distribution, in which case the number of free parameters 
depends on the sample space dimension d through n — d(d + l)/2 + d + 1. The approxima- 
tion space then contains the set of unnormalized Gaussian distributions, but also improper 
distributions since there are no explicit constraints on the ^-parameters associated with second- 
order polynomials. All the code used in these experiments was written in Python language 
based on the Scientific Python package (www.scipy.org), and is in open-source access at 
http sT//github . com/alexis-roche/ infer) 

In the simulations presented below, several target distributions are picked among exponential 
power distributions: 



where f3 is a positive-valued shape parameter, and a is a scale parameter conventionally tuned 
depending on f3 so that p a ^ has identity variance matrix 1^. Such distributions conveniently 
model different types of unimodal distributions with either sharp (/3 < 1) or flat (/3 > 1) peaks. 
Normalized Gaussian distributions are obtained in the case /3 = 2. 

We compare VS with both IS and BMC in different dimensions d, for different values of (3 
and different sample sizes N. Also, two distinct sampling strategies are considered to mimic 





2=1 
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situations where either an accurate or a crude initial approximation to the target is available. In 
one case, the sampling distribution is the Gaussian with same moments as the target distribu- 



tion, 7r = iV(0,Id). In another case, we use annealing sampling [14], as outlined in Section 2.4 



starting from tv = iV(0, 251^) and increasing the A parameter geometrically from 0.001 to 0.999 
in 1000 steps, each step comprising one Metropolis update using a Gaussian proposal with 
variance 0.0251^. Sampling time is thus about 1000 times larger in annealed sampling than 
in "matched" sampling. Using both strategies, IS was found to perform at least as well as all 
MCMC variants we tested, hence those are not reported for the sake of clarity. 

Our implementation of BMC uses an isotropic Gaussian correlation function, meaning that 
the target distribution is fitted with a mixture of Gaussian distributions with scalar variances 
vld. To save computation time, v is estimated by IS (assuming isotropic variance), rather than 
using the conventional marginal likelihood method [19J, which seemed to have little impact on 
the results. A rather large damping value = 1 is used to regularize the linear system underlying 
BMC, meaning that the fit is not a strict interpolation. This amount of damping was tuned to 
roughly optimize performance in the presented examples. 

For each combination of parameters (d,(3,N) and sampling strategy (matched or annealed), 
100 samples were simulated. All methods were run once on each sample, and the error was 
defined as the generalized KL divergence ^ between the KL-optimal Gaussian fit q* = iV(0, 1^) 
and the Gaussian fit q output by the method: e — D(q*\\q), which also reads e — D(p\\q) — 
D(p\\q+). Hence, e is the excess value of the ideal KL divergence objective. The question is 
whether IS or BMC can do a better job than VS in minimizing it. Note that e is infinite 
whenever q is improper, which occasionally happens on small samples using VS, and more 
frequently using BMC. 

Figure [3] and [4] display the simulated median errors as functions of the sample size for different 
values of the shape parameter /3, respectively in dimensions d — 1 and d — 10. Each point on 
the curves corresponds to a series of 100 simulations where the sample size is a multiple N — kn 
of the number of free parameters, for k = 1, 2, 4, 8, 16, 32 (n = 3 if d = 1 and n = 66 if d = 10). 
Recall that, according to Theorem 2J_, the VS estimator is well-defined as soon as N > n. 

The results confirm that VS is exact for normal distributions, but also show that it has 
the potential to outperform both IS and BMC for distributions that depart significantly from 
normality (/? = 1 and /3 = 3) provided that sufficiently large samples are used. Similarly to 
IS, the VS error is seen to decrease as a function of the sample size. In contrast, BMC breaks 
down beyond a certain sample size depending on the dimension, and tends to perform poorly in 
multiple dimension, as already noted in [19J. This is explained by the fact that the linear system 
underlying BMC gets bigger as samples increase in size, and may end up ill-conditioned for large 
samples. This issue is avoided in both VS and IS (if interpreted as a variational method, see 
Section 2.1), since the number of free parameters is determined by the dimension as opposed 
to the sample size. 

In the idealistic case of matched sampling, VS is significantly slower than IS since evaluating 
the target distribution is relatively cheap in our examples, so that sampling time is small 
compared to fitting time. A key observation, however, is that the computational overhead of 
VS decreases with sample size, as shown in Tables [I] and [2j while BMC shows the opposite 
trend. In annealed sampling, sampling time becomes predominant and differences in timing 
between methods are dramatically reduced. 
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Figure 3: Median errors of VS (blue), IS (orange) and BMC (green) for exponential power 
distributions in dimension 1 with shape parameter f3 = 1,2,3 from left to right. Top 
row: matched sampling, bottom row: annealed sampling. 
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Figure 4: Same as Figure |3] in dimension 10. Errors larger than the y-axis range are not repre- 
sented (VS errors vanish for (3 = 2). 
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Sample size 


3 


6 


12 


24 


48 


96 


Matched sampling 


BMC 
VS 


3.3 
5.7 


3.6 
5.4 


4.1 

5.3 


6.0 
5.5 


9.4 
5.2 


16.8 
4.9 


Annealed sampling 


BMC 
VS 


1.0 
1.0 


1.0 
1.0 


1.0 
1.0 


1.0 
1.0 


1.0 
1.0 


1.0 
1.0 



Table 1: Computation times of VS and BMC relative to IS in dimension 1. On a single processor 
Intel Core i7 720QM running at 1.60GHz, IS took about 1ms per sample with matched 
sampling (sampling time was then negligible), and between 100ms and Is with annealed 
sampling depending on sample size. 





Sample size 


66 


132 


264 


528 


1056 


2112 


Matched sampling 


BMC 
VS 


21.3 
30.0 


37.6 
23.0 


79.4 
19.6 


179.2 
14.9 


381.9 
13.7 


604.5 
9.3 


Annealed sampling 


BMC 
VS 


1.0 
1.2 


1.0 
1.1 


1.0 
1.0 


1.1 
1.0 


1.3 
1.0 


1.6 
1.0 



Table 2: Computation times of VS and BMC relative to IS in dimension 10. IS took between 1ms 
and 16ms per sample with matched sampling, and between Is and lis with annealed 
sampling. 



4 Conclusion 

In summary, the proposed VS method is an approximate moment matching technique that relies 
on few assumptions regarding the target distribution and can serve as an efficient alternative to 
conventional Monte Carlo methods. At the theoretical level, we established that the VS esti- 
mator is both well-defined under mild conditions (Theorem |2.1[), and converges in distribution 



towards the actual moments in the limit of large samples (Theorem 2.2). At the practical level, 
our experiments showed that VS can significantly outperform existing methods in the problem 
of estimating moments of order up to 2 of unimodal distributions in moderate dimension. In 
realistic scenarios where no "good" initial sampling scheme is available, VS can be fruitfully 
combined with Neal's annealed importance sampling idea [14J at low computational cost. 

An important remaining challenge for machine learning applications is to extend VS in high 
dimension. One easy way to go, in problems where the target distribution factorizes as a 
product of low dimensional factors, is to use VS in conjunction with message-passing algorithms 
[T3] to repeatedly perform factor-wise approximations. Such local applications of VS may be 
useful whenever messages lack closed- form expressions, hence VS has the potential to widen 
the application field of message-passing. In this approach, however, message-passing is used 
as a dimension reduction device that converts the globally convex minimization problem of VS 
into a series of simpler convex minimizations. While the computational advantage is clear, this 
might have a negative impact on estimation quality. 

Implementing the direct, globally convex, VS approach in high dimension first raises engineer- 
ing issues as the number of free parameters is bound to increase with dimension - for instance, 
it increases in d 2 /2 in the Gaussian fitting problem. These can be tackled by software design 
solutions such as block-alternating optimization or related techniques that can reduce memory 
load without distorting the global minimum (as long as the problem is convex). On the other 
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hand, there is the more fundamental "curse of dimensionality" issue of how large samples need 
to be for VS to reach acceptable estimation errors. Theorem 2.1 gives some insight into this 
question as it states that the VS estimator is unique provided that the sample is larger than 
the parameter dimension, itself dependent on the sample space dimension. However, it is yet 
unclear whether the uniqueness condition implies that the minimization is sensible in that the 
solution is reasonably close to the ideal KL divergence minimization ©. We see this as an open 
research issue related with the problem of diagnostics in MCMC methods pQ . 
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